VQA benchmark for four Arabic dialects
Our testing reveals that leading open-source Arabic models struggle with dialect-specific tasks. Better dialect understanding can help models interpret contextual clues in text and images.
JEEM’s data structure
JEEM consists of 2196 annotated images distributed across four dialects:
Jordan (Levantine) — 606 images
Emirates (Gulf) — 150 images
Egypt (Egyptian) — 863 images
Morocco (Maghrebi) — 577 images
A smaller cross-cultural set has 100 images annotated by all four dialect teams for comparison
Images cover a range of topics: transport, food and beverages, places, nature, sports, arts and culture, education, technology and others
Our data collection process
Regional-specific images are sourced manually from open-source databases.
Annotator A describes the image in both MSA and their dialect (image caption).
Annotator B formulates questions in their dialect based only on the image caption.
Annotator C reviews the image, caption, and questions, then provides answers in their dialect.
Data samples
Cross-dialect data subset
100 images in the dataset are captioned by speakers of all four dialects for comparison. Some examples demonstrate narrow cultural contexts that are easily misinterpreted by Arabic speakers from other regions. In general, VLMs lack knowledge of regional nuances.
For example, this image of Omani halwa is interpreted as a different sweet depending on the region
Jordanian
Traditional dessert... almonds... pistachios... karawya or dibs
طبق حلو تقليدي... اللوز... الفستق... بالكراوية أو الدبس
Emirati
Omani halwa
حلوى عمانية
Egyptian
Pudding... chocolate... pine seeds
لبودنج... شيكولاتة... صنوبر
Moroccan
Chocolate... caramel... coconut and pistachios
شكلاط... كراميل... بالكوكو و بيسطاش
Model performance
We ran comprehensive evaluations of the latest Arabic VLMs: Maya, PALO, Peacock, AIN, AyaV, and GPT-4.
The evaluation process covered 3 types of metrics:
1. Surface-level and embeddings-based metrics (BLEU, CIDEr, ROUGE, BERTscore)
2. Human evaluation of image captioning
3. LLM-as-a-judge evaluation of image captioning and question answering
Human and LLM-based evaluations focused on the same four criteria: Consistency, Relevance, Fluency, and Dialect Authenticity.
Correlation analysis showed a strong correlation between LLM judgments and human judgments (see our paper).
Summary of results

All VLMs struggle with low-resource dialects like Emirati

GPT-4o performed best overall, but still needs improvements

All VLMs scored lower on JEEM than on English-based benchmarks

Dialect authenticity is low across the board. For details, read our paper
Key insights

JEEM uses everyday images to encourage natural writing and create a realistic test bed. It offers a novel resource for evaluating and improving language models in real-world contexts.

Even frontier models struggle with processing everyday languages, making them less accessible to many language communities.

LLM-as-a-judge approaches correlate well with human judgments, offering scalable and reliable evaluation of VLM performance.
Contributors